packages <- c(
"acidAdaptedRNAseq",
"GO.db",
"GOSim",
"igraph",
"tidyverse",
"ggraph",
"multipanelfigure",
"grid",
"gtable",
"magrittr",
"printr"
)
purrr::walk(packages, library, character.only = TRUE)
## Warning: replacing previous import 'AnnotationDbi::select' by
## 'dplyr::select' when loading 'acidAdaptedRNAseq'
rm(packages)
#source('~/Github/diverseScripts/R/scripts/largeDivergingPalettes.R')
Previously we have performed a gene ontology (GO) term enrichment analysis using the differentially expressed genes (adjusted p-value < 0.05). Due to the fact that many of the significant GO terms are related to similar biological processes, it is desireable to facilitate the understanding of the results to combine similar terms into groups, thus minimizing the number of terms to be simultaniously considered.
In this analysis we use comminuty detection in order to reduce terms into groups but still strive to maintain the majority of the information concerning the relationship of GO terms to each other. In addition, we highlight the differences in gene expression between the experimental conditions in the detected communities. Finally, we show the specific expression of a subset of the genes in some of the GO terms in several of the communities.
#load GO result and subset signignificant terms
data(topGOresult, package = "acidAdaptedRNAseq")
sig <- topGOresult[topGOresult$classicFisher < 0.05, ]
To understand the relationship between the terms found to be significant in the GO term enrichment analysis, we first download the GO graph for all of the terms. This returns a graph of all the significant terms as well as their ancestors.
We then proceed to perform community detection and collapse the graph and simplify the edges. For the community detection we use the spin-glass algorithm. From the available algorithms in the igraph package, a subset of these are not appropriate for directed graphs and these were not evaluated. Those which are appropriate for directed graphs were all tested and evaluated individually by a) the number of communities detected and b) the biological relevance of the communities detected. a) was desired to be the smallest number possible without have a strong negative impact on b). b) was evaluated manually by reviewing the GO terms present in each commuity and their biological relationship to each other. Although both the infomap and spinglass community detection method performed well for a), it was determined that the spinglass algorithm performed better for b) and was therefore implemented in the analysis.
A resonably detailed summary of the community detection algorithms available in the igraph package can be found at: https://stackoverflow.com/questions/9471906/what-are-the-differences-between-community-detection-algorithms-in-igraph
Collapsing the graph on communities results in each community being represented by one node. Furthermore, in each detected community the node with the highest authority score is determined. Due to the orientation of the graph, this node represents the term in the community that has the most incomming edges, also known as the node with the highest indegree, and is thus the node closest to the root of the graph (i.e. the “all” term). In the downstream analysis, the node with the highest authority score is used to represent the community. In addition, in the case where there were edges between the terms now located in communities, a single edge between the communities that those terms are now located in, is used to represent this relationship.
#get GO graph
g <- downloadGOgraph(sig$GO.ID)
#run community analysis
tmp <- collapseGraph(g)
cg <- tmp[[1]]
comSummary <- tmp[[2]] %>%
left_join(select(sig, GO.ID, classicFisher), by=c("GOID" = "GO.ID"))
The community detection algorithm detected a total of 39 communities. Below we calculate and show the number of genes included in each community as an indication of the communities information content. The results indicate that the smallest community is comprised of only 5 genes (lymphocyte apoptotic process) whereas the largest community (cellular metabolic process) has just under 30000.
#calculate genes per term
uCommunityNames <- unique(comSummary$communityName)
genesPerCom <- lapply(1:length(uCommunityNames), function(x) {
bool <- comSummary$communityName == uCommunityNames[x]
idsToGet <- comSummary[bool, ]$GOID
genesForIdsString <- sig[sig$GO.ID %in% idsToGet, "ID"]
genesForIdsList <- strsplit(genesForIdsString, ", ")
length(unlist(genesForIdsList))
})
names(genesPerCom) <- uCommunityNames
namedListToTibble(genesPerCom) %>%
setNames(c("communityName", "nrCommunityGenes")) %>%
inner_join(distinct(select(comSummary, communityName, communityTerm))) %>%
select(communityTerm, nrCommunityGenes, -communityName)
| communityTerm | nrCommunityGenes |
|---|---|
| proteolysis | 1531 |
| cell proliferation | 1229 |
| nucleic acid phosphodiester bond hydrolysis | 456 |
| single-organism cellular process | 3815 |
| cellular metabolic process | 28991 |
| positive regulation of B cell activation | 236 |
| autophagy | 55 |
| protein heterotetramerization | 48 |
| regulation of cellular process | 16067 |
| somatic recombination of immunoglobulin genes involved in immune response | 1088 |
| cell differentiation | 705 |
| single-multicellular organism process | 1097 |
| lipid metabolic process | 170 |
| Wnt signaling pathway | 825 |
| cell adhesion | 62 |
| regulation of biological quality | 467 |
| carboxylic acid metabolic process | 253 |
| protein phosphorylation | 5040 |
| single-organism developmental process | 4643 |
| negative regulation of gene expression | 2348 |
| cellular component organization | 1379 |
| multi-organism process | 2926 |
| action potential | 41 |
| detoxification | 3962 |
| lymphocyte apoptotic process | 5 |
| protein refolding | 116 |
| neuron projection development | 224 |
| intracellular signal transduction | 1152 |
| response to cytokine | 285 |
| sulfur compound metabolic process | 249 |
| antigen processing and presentation of exogenous peptide antigen via MHC class I, TAP-independent | 69 |
| cell cycle | 2843 |
| ameboidal-type cell migration | 717 |
| cell death | 2853 |
| generation of precursor metabolites and energy | 30 |
| DNA duplex unwinding | 42 |
| secretion | 831 |
| RNA biosynthetic process | 175 |
| translation | 572 |
As a measure of the effectivity of the community detection we can review the dissimilarity between communities by first calculating the Jaccard similarity between communities and then performing hierarchical clustering using 1-similarity. The results (below) indicate that the majority of communities have a reasonable level of dissimilarity with the exception of the protein refolding and antigen processing and presentation of exogenous… community.
cgSim <- similarity(cg, mode = "all", method = "jaccard")
colnames(cgSim) <- unique(
comSummary[match(V(cg)$name, comSummary$communityID), ]$communityTerm
)
rownames(cgSim) <- unique(
comSummary[match(V(cg)$name, comSummary$communityID), ]$communityTerm
)
heatmap(1 - cgSim, margins = c(20,20))
#adds attributes to the collapsed graph
cg <- addAttributes1(cg, comSummary)
To give a graphical representation of the size of each community, we now add all of the terms found to be significant in the GO term enrichment analysis back to the collapsed graph. In addition, we draw an edge from each individual term to the community that is is a member of.
It should be noted that now the graph only contains terms found to be significant in the GO term enrichment analysis with the exception of the community nodes. Due to the fact that the authority score on the whole graph was used to determine the community nodes, it can be the case that the community node itself was not actually found to be a significant term. Despite this, each of the terms in the community were significantly enriched. Therefore the community nodes should be viewed as a summary or label encasing the significant terms rather than a significant term itself (although in some cases it is).
#add vertices and edges from uncollapsed graph
toAdd <- comSummary %>%
filter(!GOID %in% communityID) %>%
filter(classicFisher < 0.05) %>%
select(GOID, communityID) %>%
setNames(c("from", "to"))
cg <- addBackVertexAndEdges(cg, toAdd)
#add a line break in long community groups so the plot legend can be seen
comSummary$communityTerm <- ifelse(
comSummary$communityTerm == "antigen processing and presentation of exogenous peptide antigen via MHC class I, TAP-independent",
"antigen processing and presentation of exogenous\npeptide antigen via MHC class I, TAP-independent",
comSummary$communityTerm
)
comSummary$communityTerm <- ifelse(
comSummary$communityTerm == "somatic recombination of immunoglobulin genes involved in immune response",
"somatic recombination of immunoglobulin\ngenes involved in immune response",
comSummary$communityTerm
)
comSummary$communityTerm <- ifelse(
comSummary$communityTerm == "generation of precursor metabolites and energy",
"generation of precursor\nmetabolites and energy",
comSummary$communityTerm
)
#add attributes
cg <- addAttributes2(cg, comSummary)
#plot
plotCollapsedGraph(cg, 12, 3)
collapsed <- plotCollapsedGraph(cg)
mylegend <- g_legend(collapsed)
collapsed <- collapsed + theme(legend.position = "none")
A network graph visualizing the results from the community analysis. Edge (line) direction is represented by color with edges originating from a node inheriting that nodes color. Edges between community nodes (large points) indicate that the GO terms representing the nodes are each other’s ancestors or offspring dependent on the direction of the edge. Edges between term nodes (small points) and community nodes indicate the terms inclusion in that community.
In summary, many of the detected communities represent processes which we would expect to be differentially regulated in the experimental conditions (i.e. cell cycle, cell death, cell adhesion). In addition, multiple communities related to metabolism can be identified (i.e. cellular metabolic process, carboxylic acid metabolic process, lipid metabolic process).
It should be noted that some edges that could be expected may not appear. For example, one may expect an edge between cell cycle and cell proliferation, due to the fact that they are biologically related. In this case, and potentially others, the terms are actually connected via an intermediate term (single-orgamism cellular process) and therefore do not have an edge directly connecting the nodes. This is unfortunate for the understanding of the results but has, so far, proved to be unavoidable due to the nature of the analysis.
Due to the fact that the GO term enrichment analysis was performed using significant (alpha < 0.05) differences in gene expression between experimental conditions, we would expect the gene expression in the terms found to be significant (and therefore represented in the graph above) to be different. Despite this, a visualization of this may help to strengthen the belief that the communities in the graph represent true differences in the experimental conditions. To facilitate this understanding we plot the differences in each experimental condition per community below.
#extract all gene IDs per GO term
genePerTerm <- extractGenesFromGO()
#calculate z-score
zNorm <- calculateZscore()
#merge data, note that GO graph terms not in sig are included in comSummary
heatmap <- comSummary %>%
filter(!is.na(classicFisher)) %>%
full_join(genePerTerm, by = "GOID") %>%
left_join(zNorm, by = "ID") %>%
gather(condition, zScore, -communityName,
-communityID, -communitySize, -communityTerm,
-GOID, -authority.score, -Term,
-classicFisher, -ID
)
#run clustering per community to get row order
order <- clusterForOrder(heatmap)
#merge order into summary
heatmap <- left_join(heatmap, order, by = c("communityName", "ID"))
#plot
heatmap$plotCondition <- plotCondRename(heatmap$condition)
heatmap <- plotHeatmap(heatmap)
A heatmap representing gene expression profiles in the detected communities. Communities are indicated by the color bar on the right side of the heatmap. Z-score indicates the regularized log transformed expression values scaled between 0 and 1.
Finally, we wish to highlight genes which would be expected to be disregulated in the experimental model and their location in the community graph. To do this, we extract the expression values for individual genes and plot them in each experimental condition while connecting them each back to the term and community they are located in.
#setup rld expression values for merge
rld <- deResultsRld %>%
as.data.frame() %>%
rownames_to_column() %>%
rename(ID = rowname) %>%
as_tibble() %>%
gather(condition, expression, -ID)
#set up gene p-values for merge
ps <- deResults %>%
select(ID, padj) %>%
as_tibble()
#setup data for plot
barDat <- comSummary %>%
filter(!is.na(classicFisher)) %>%
full_join(genePerTerm, by = "GOID") %>%
left_join(rld, by = "ID") %>%
left_join(geneAnnotation, by = "ID") %>%
left_join(ps, by="ID")
barDat$plotCondition <- plotCondRename(barDat$condition, reps = FALSE)
#Specify terms to plot
terms <- c(
"positive regulation of autophagy",
"response to osmotic stress",
"cell death",
"positive regulation of metabolic process",
"negative regulation of cell differentiation",
"regulation of protein ubiquitination",
"cell cycle",
"cell proliferation"
)
#plot for rmarkdown
plotGenes(barDat, terms[1])
plotGenes(barDat, terms[2])
plotGenes(barDat, terms[3])
plotGenes(barDat, terms[4])
plotGenes(barDat, terms[5])
plotGenes(barDat, terms[6])
plotGenes(barDat, terms[7])
plotGenes(barDat, terms[8])
Gene expression profiles for selected terms in a subset of the detected communities for parental and acid adapted cells.
#plot for figure
p1 <- plotGenes(barDat, terms[1])
mylegend2 <- g_legend(p1)
p1 <- plotGenes(barDat, terms[1]) + theme(legend.position = "none")
p2 <- plotGenes(barDat, terms[2]) + theme(legend.position = "none")
p3 <- plotGenes(barDat, terms[3]) + theme(legend.position = "none")
p4 <- plotGenes(barDat, terms[4]) + theme(legend.position = "none")
p5 <- plotGenes(barDat, terms[5]) + theme(legend.position = "none")
p6 <- plotGenes(barDat, terms[6]) + theme(legend.position = "none")
p7 <- plotGenes(barDat, terms[7]) + theme(legend.position = "none")
p8 <- plotGenes(barDat, terms[8]) + theme(legend.position = "none")
#set up figure panels
figure1 <- multi_panel_figure(
width = c(200, 100, 100, 100),
height = 350,
rows = 5,
unit = "mm"
)
#plot to panels
figure1 %<>% fill_panel(mylegend, column = 1:4, row = 5, label = "")
figure1 %<>% fill_panel(collapsed, column = 1, row = 1:4, label = "A")
figure1 %<>% fill_panel(heatmap, column = 2, row = 1:4, label = "B")
figure1 %<>% fill_panel(p1, column = 3, row = 1, label = "C")
figure1 %<>% fill_panel(p2, column = 3, row = 2, label = "D")
figure1 %<>% fill_panel(p3, column = 3, row = 3, label = "E")
figure1 %<>% fill_panel(p4, column = 3, row = 4, label = "F")
figure1 %<>% fill_panel(p5, column = 4, row = 1, label = "G")
figure1 %<>% fill_panel(p6, column = 4, row = 2, label = "H")
figure1 %<>% fill_panel(p7, column = 4, row = 3, label = "I")
figure1 %<>% fill_panel(p8, column = 4, row = 4, label = "J")
#figure1
#save final figure
ggsave(
plot = figure1,
'~/Github/acidAdaptedRNAseq/inst/GOanalysis/figure.pdf',
width = 550,
height = 350,
units = "mm",
limitsize = FALSE
)
ggsave(
plot = mylegend2,
'~/Github/acidAdaptedRNAseq/inst/GOanalysis/legend.pdf'
)
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2 printr_0.1
## [3] magrittr_1.5 gtable_0.2.0
## [5] multipanelfigure_0.10.5 ggraph_1.0.0
## [7] forcats_0.2.0 stringr_1.2.0
## [9] dplyr_0.7.4 purrr_0.2.4
## [11] readr_1.1.1.9000 tidyr_0.7.2
## [13] tibble_1.3.4 ggplot2_2.2.1
## [15] tidyverse_1.2.1 igraph_1.1.2
## [17] GOSim_1.16.0 annotate_1.56.1
## [19] XML_3.98-1.9 GO.db_3.4.1
## [21] AnnotationDbi_1.40.0 IRanges_2.12.0
## [23] S4Vectors_0.16.0 Biobase_2.38.0
## [25] BiocGenerics_0.24.0 acidAdaptedRNAseq_0.0.0.9000
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.13 assertive.base_0.0-7
## [3] colorspace_1.3-2 modeltools_0.2-21
## [5] rprojroot_1.2 corpcor_1.6.9
## [7] rstudioapi_0.7 topGO_2.30.0
## [9] ggrepel_0.7.0 flexmix_2.3-14
## [11] bit64_0.9-7 lubridate_1.7.1
## [13] xml2_1.1.1 codetools_0.2-15
## [15] mnormt_1.5-5 knitr_1.17
## [17] jsonlite_1.5 broom_0.4.3
## [19] cluster_2.0.6 png_0.1-7
## [21] graph_1.56.0 ggforce_0.1.1
## [23] compiler_3.4.2 httr_1.3.1
## [25] backports_1.1.1 assertthat_0.2.0
## [27] Matrix_1.2-12 lazyeval_0.2.1
## [29] cli_1.0.0 tweenr_0.1.5
## [31] htmltools_0.3.6 prettyunits_1.0.2
## [33] tools_3.4.2 glue_1.2.0
## [35] reshape2_1.4.2 rsvg_1.1
## [37] ggthemes_3.4.0 Rcpp_0.12.13
## [39] cellranger_1.1.0 nlme_3.1-131
## [41] udunits2_0.13 assertive.files_0.0-2
## [43] psych_1.7.8 rvest_0.3.2
## [45] org.Hs.eg.db_3.5.0 MASS_7.3-47
## [47] scales_0.5.0 rgeolocate_1.0.1
## [49] hms_0.4.0 RBGL_1.54.0
## [51] SparseM_1.77 yaml_2.1.14
## [53] curl_3.0 memoise_1.1.0
## [55] gridExtra_2.3 biomaRt_2.34.0
## [57] stringi_1.1.6 RSQLite_2.0
## [59] highr_0.6 tiff_0.1-5
## [61] caTools_1.17.1 rlang_0.1.4
## [63] pkgconfig_2.0.1 matrixStats_0.52.2
## [65] bitops_1.0-6 evaluate_0.10.1
## [67] lattice_0.20-35 bindr_0.1
## [69] labeling_0.3 assertive.properties_0.0-4
## [71] tidyselect_0.2.3 bit_1.1-12
## [73] plyr_1.8.4 R6_2.2.2
## [75] DBI_0.7 haven_1.1.0
## [77] foreign_0.8-69 units_0.4-6
## [79] assertive.numbers_0.0-2 RCurl_1.95-4.8
## [81] nnet_7.3-12 modelr_0.1.1
## [83] crayon_1.3.4 assertive.types_0.0-3
## [85] rmarkdown_1.8 jpeg_0.1-8
## [87] viridis_0.4.0 progress_1.1.2
## [89] readxl_1.0.0 blob_1.1.0
## [91] digest_0.6.12 xtable_1.8-2
## [93] liftr_0.7 gridGraphics_0.2
## [95] munsell_0.4.3 viridisLite_0.2.0